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Abstract. The ADS platform is undergoing the biggest rewrite of its 20-year history. 
While several components have been added to its architecture over the past couple of 
years, this talk will concentrate on the underpinnings of ADS’s search layer and its 
API. To illustrate the design of the components in the new system, we will show how 
the new ADS user interface is built exclusively on top of the API using RESTful web 
services. Taking one step further, we will discuss how we plan to expose the treasure 
trove of information hosted by ADS (10 million records and fulltext for much of the 
Astronomy and Physics refereed literature) to partners interested in using this API. This 
will provide you (and your intelligent applications) with access to ADS’s underlying 
data to enable the extraction of new knowledge and the ingestion of these results back 
into the ADS. Using this framework, researchers could run controlled experiments with 
content extraction, machine learning, natural language processing, etc. In this talk, we 
will discuss what is already implemented, what will be available soon, and where we 
are going next. 


1. Introduction 


The S AO/NAS A Astrophysics Data System ( Kurtz et'aL]|2000 1 has gone through sig¬ 
nificant changes over the course of its 20 year career, but its purpose has always been 
to provide relevant information and services to the scientific communify. The ADS has 
achieved fhis goal fhrough constanf innovafion and adoption of new technologies. In 
fhis paper we will describe fhe changes fhaf are happening fo fhe ADS in ifs recenf in- 
camafion. They consfifufe fhe biggesf rewrite in fhe hisfory of fhe sysfem and mark fhe 
opening of many new possibilities, particularly in discovery capabilifies and support 
for bibliomefric sfudies. 


2. Environment in which the ADS operates 


The ADS is an information system - built by and for scientists to provide access to 
scientific literature. Its user base is considerable, currently at any given time there 
are ~ 200 - 300 active users on the website (which amounts to ~ 1 million unique 
sessions per month), although its core user base is much smaller ( Kurtz et al.||2009 l. 
Eor the majority of these (anonymous) users worldwide the ADS is the destination they 
reached after following a link from Wikipedia or Google to one of its records. However 
for about 50,000 scientists worldwide, the ADS represents a gateway to information. 
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they use the system every day (or every other day). And for the ADS they are the users 
who really care and need to be cared for. 

The ADS provides its services within a greater information space environment, 
which includes several service providers. One such providers is the arXiv, arguably 
one of the most indispensable academic services, at least in the fields of astronomy and 
physics. It has become a de facto publishing standard to post pre-prints on the arXiv 
before they are available as published and reviewed papers through more traditional 
channels. However, as many things that we take for granted and expect them to be 
there, the arXiv operates with a very limited budget and resources. 

The arXiv (and its volunteers, the scientists) performs the hard work of pre-filtering 
submissions and categorizing them. It does serve one purpose and serves it well, yet 
it needs other projects to complement it. One such project is the ADS, which closely 
collaborates with arXiv, downloads and indexes every single arXiv submission - it is 
perhaps a little known fact, but you can use the powerful search capabilities of the ADS 
to sift through all arXiv fulltexts. 

Throughout its existence, the ADS has established such symbiotic relationships in 
order to provide the best service to the scientific communify. If collaborafes nof only 
wifh arXiv, buf also wifh dafa repositories, publishers, and people behind producfion of 
journals and conferences, if has linked pre-prinfs to fhe published papers and ofher grey- 
liferafure, so fhaf fhey nof disappear info obscurify. Tracking fhe imporfanf channels of 
scienlific informafion logically led to demands for indicafors of scientific confribufion 
(fhe cifafion indicafors) - which in furn gef (mis)used as mefrics of scienfific success. 
Despife how we may look af if, fhese indicafors remain imporfanf fools and fhe ADS 
recognizes fheir importance. In addifion fo scienfisfs, who come in fheir roles of re¬ 
searchers and readers, fhere are also adminisfrafors and evaluafors, librarians, as well 
as fechnocrafs and granf agencies who use fhe ADS dafabase for ofher, non-scienfific 
purposes. ( Kurfz ef al.|2005 ; Accomazzi ef al.|2007 1 

The reason we mention fhese diffeenf user populafions is fo explain fhe confexf in 
which fhe ADS operafes. If has fo choose wisely which services are feasible fo imple- 
menf and mainfain. Given limifed resources, fhis becomes nof only a fechnological, buf 
also a sociological challenge. 

The discussion becomes simpler if we consider fhe enfifies (objecfs) processed by 
fhe ADS. As an informafion sysfem, fhe ADS primarily processes scientific literature, 
so fhe mosf frequenl dafa fypes fhaf if deals wifh are: 


• Preprints and journal articles'. The ADS will sfore fhem in fhe dafabase (fheir 
fulltext confenf as well as affached objecfs such as extracted images). 

• Metadata'. Mefadafa can be very expensive fo creafe and curafe buf ifs use may 
be surprisingly powerful. For example, if aufhor affiliations are known, one can 
compile cumulafive reporfs abouf fhe scientific oufpuf of a certain insfifufe, de- 
parfmenf and/or universify (even counfry or confinenf). 

• Citation information'. Citing and cifed by are very powerful pieces of informa- 
lion, fhey are fhe basis of fhe cifafion index and fhe mefrics available fhroughouf 
fhe ADS. Ifs use in fhe ADS is so imporfanf fhaf in fhe new version of fhe search 
engine if has been incorporaled info fhe index, so you can search for cifalions in 
fhe same way as you search for papers. 
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• Usage information: This is information about readers, what they are reading, 
what papers are popular, downloaded, consulted and how often|^ 

Finally, there is raw data, but in case of the ADS rather Links to science data. 
These consists of a network of links between bibliographic records hosted by ADS 
and data products, astronomical object, and electronic articles hosted elsewhere. Other 
projects (such as NED and the CDS) specialize in collecting this information and do a 
much better job with it than what the ADS could do. The role of the ADS here is to 
serve as a gateway and make these resources seen and used. 

And what we mean by saying that is not just people who sit behind computer 
screens looking at the ADS web interface - to the ADS developers those people are 
morphing with the machines (robots) that start to access the new ADS infrastructure. 
This has nothing to do with de-humanization or other negative connotations of informa¬ 
tion processing. As far as information systems are concerned, it has become impractical 
to treat users differently from machines. The differences befween fhe humans and fhe 
clients (robots, programs, machines) that access the ADS on the users’ behalf are disap¬ 
pearing, and this brings us to the new era and new architecture of the ADS. Since a few 
years, the system has been preparing itself for a change and it is not exactly evolution 
but rather revolution. 


3. New architecture and the API 

While the old infrastructure of the ADS is similar to a huge building which hosts every 
department and which has only one entrance (and only one power supply), the new 
infrastructure is much more distributed and lightweight, more similar to a town with 
roads connecting different buildings. The central component here is the Application 
Programming Interface (API) layer^- we have split the internal subsystems into mi¬ 
croservices. They are independent, standalone web-services that communicate with 
each other through the API using the REST and OAuth protocols (and soon also exclu¬ 
sively through HTTPS). Every service, even the ones that we consider the most critical, 
are connected to the API and available through it; can be accessed from anywhere on 
the internet provided that the client has the proper authorization token. 

The micro-services are virtualized, currently running off fhe Amazon cloud. They 
can be scaled horizonfally: if fhere is need for more power, we add compufafional 
nodes fo fhe cloud. Using fhis new organizafion, fhe ADS can more easily add new 
funcfionality (new services), and if is relatively easy to alter existing services. Ba¬ 
sically, implementations can be swapped at will, provided the semantics of the API 
remains unchanged. It is still challenging to find fhe absolute minimal set of the API, 
but the system has already proven itself. It simplifies mainfenance and allows for rapid 
developmenf. 


'This information is properly anonymized before it gets used by different algorithms and subsystems of 
the ADS. 

^The API layer is itself published on GitHub, at http://github.com/adsabs/adsws 

^As will be discussed later, this has also a negative consequences; at the moment, the ADS is changing so 
rapidly, that is hard to keep the documentation up to date. It requires certain mental discipline and robust 
practices and we are still struggling to find the optimum balance. 
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3.1. The search 

The core of the new ADS (similarly to the previous generation of the ADS) is centered 
around search. The search engine is built on a heavily modified and extended version 
of Apache SOLR (http://lucene.apachG.org/solr/). As one would probably 
expect, it has many advantages over the in-house custom search service developed for 
ADS Classic 20 years ago. Unsurprisingly, it is able to address much bigger memory 
spaces which translates into an ability to index bigger volumes of data. For the first 
time in ADS history, it is possible to search through fulltext content together with the 
metadata. Additionally, we have integrated the citation and co-readership networks 
into the index, so you can now simultaneously explore all three different search spaces: 
metadata, fulltext content, and citations. 

The new search engine provides many features, some of which will be totally new 
to the ADS users, for example: 

• proximity search - ability to find a group of tokens based on their distance, e.g. 
body:(weak NEAR5 leasing), which finds phrases where weak appear within 5 
words of leasing. This query would find phrase ’’weak leasing”, as well as ’’weak 
gravitational leasing” in the fulltext 

• similarity search - (e.g. author:eisenstein~ 0.3) will find different spellings of 
the author name based on different similarity metric (this can be used to correct 
for misspellings) 

• regular expressions - most often used by curators to find terms that follow a 
particular pattern e.g. grant numbers or astronomical entities 

The features described above are provided by the Lucene search library upon 
which SOLR is built. In addition, ADS has extended Lucene capabilities by imple¬ 
menting its own query language, modifying the traditional search syntax and enhanc¬ 
ing it with new search operators and modifiers. The main purpose of this dialect is to 
facilitate communication between clients and servers. We do not expect normal ADS 
users to know the specialized syntax, but we want to give them tools to execute complex 
queries if they ever have a need to do so. More likely though, this query language will 
be used by clients (programs) that search the ADS on behalf of users. To illustrate the 
point, here is a short list of a few specialized modifiers/operators: 

• =searchterm - the equals sign modifier is perhaps known to ADS users, it dis¬ 
ables the synonym expansion so that the search engine will not look for synonyms 
or alternate spellings. 

• references(author:einstein) and citations(author:einstein) - these are the opera¬ 
tors to retrieve papers that cite a collection of papers, or those that are cited by 
that collection. Operators can be freely nested and combined with other terms, 
even if that means that you are going to analyze the entire ADS corpus, e.g. 
citations(references(author:einstein)) 

• usefulO, instructiveO, trendingO and other operators - the ADS search engine 
supports so-called second order queries which use the citation and readership 
network to provide advanced search capabilities. Currently these operators use 
multiple algorithms to retrieve papers that are most cited by papers on a topic 
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Figure 1. Screenshot of the new UI for the ADS, using normal search. 


(useful), citing the most relevant papers on a topic (instructive), or most read by 
people interested on a topic (trending). The ADS will be experimenting with 
different implementations of these operators and will be adding more of them as 
needed. 

The citation operators together with custom analytical functions are potentially 
the most fun part of the new search engine. It is possible to construct custom searches 
just for the purpose of supporting certain visualizations (e.g. a view of a co-readership 
network) or use cases (e.g. selecting the Mh token in an list of author names). In the 
future, we expect users of the API to come up with novel ways of using them, something 
we did not anticipate when they were created. 

3.2. Other services 

The best example of the API in action is the new user interface for the ADS, code- 
named Bumblebee. It is a rich JavaScript client that runs in your web browser. It is 
the future replacement of the ADS Abstract service, but unlike that one, virtually every 
interaction happens through the API. So without knowing it, when you visit http: // 
ui. adslabs. org, you are automatically becoming an API user and accessing multiple 
web services. It will not feel any different, but the difference in implemenfafion is 
significanf: fo have access fo fhe API means fhaf you are a firsl-class cifizen of fhe ADS 
world, accessing fhe same inferface fhaf ADS developers use every day. 

Af fhe fime of fhis wrifing, fhere exisf mulfiple services besides search: 

• metrics: fhe familiar cifafion and usage reporfs (see |^, if you access fhe API 
direcfly, you can download fhe dafa fhaf are used fo generafe fhe mefrics views 
(and build your own version of fhe visualization) 


course we are aware that in order for this to happen we need to provide better documentation of the 
search functionality. 
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Search Form author "einstein, a" 
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Figure 2. Citation metrics report for the most cited 100 papers of A. Einstein. 


e visualizations: this service exposes different visualizations based on the relations 
between authors and papers, e.g. word-clouds, paper networks, citation networks, 
and bubble-charts of search results (see|^. 

• ORCID: service that allows users to claim authorship through the ADS interface 
into the ORCID database^ 

Information on how to access and use the Api for these services, and any future 
ones, is available through: http://adsabs. github. io/help 


4. Future directions 

We are aware that the new API has a number of shortcomings especially with respect 
to documentation. It is still a very young and relatively new component and it may 
not be a smooth ride for external developers, but the ADS has already gone a long 
way in making this new ’’mode” available and will continue improving it. In the past 
year, the ADS has re-engineered many of its subsystems from the ground up (after a 
few false starts). The current implementation has shown that the system architecture 
is robust, and this gives us confidence to keep going in this direction. To offer a level 
of stability to the platform, major changes to the API will be versioned so that we 
will be able to provide backward compatibility to applications built on it. Our hope is 


^ORCID stands for Open Researcher and Contributor ID: http: //orcid. org 
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Search Form » author "einstein, a" 
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Uber die von der molekularkinetischen Theorie der Warme geforderte 
Bewegung von in ruhenden Fliisslgkelten suspendlerten Tellchen, Einstein. A. 
(1104 citations) 

Die Plancksche Theorie der Strahlung und die Theorie der spezifischen 

Warme, Einstein. A. (134 citations) 

Zur Theorie der Lichterzeugung und Lichtabsorption, Einstein. A. (36 citations) 
Zur allgemelnen molekularen Theorie der Warme, Einstein. A. (24 citations) 


Papers in this group; 


;iehung 

item 


selastkchen 


enagie 

festen 


Papers highly referenced by papers in this group: 

(A tr indicates that this paper is also a member of Group 2) 


1905AnP....17..132E (cited by 50% of papers in this group) 
1903AnP, 11 170E (cited by 100% of papers in this group) 


(Click on a link above to view the paper's abstract page in ADS in a new tab) 


warme 


Figure 3. Paper networks inside the first 100 most cited papers of A. Einstein. 

that ultimately this open API platform will attract development from outside the ADS 
group, leveraging potential hidden in wider community. 

The new generations of astronomers and data scientists are technically very savvy 
and ready to embrace emerging technologies to conduct their research. We hope that 
they will appreciate the level of access the new API provides, and that they will take 
the opportunity to build something that satisfies their research needs (and share it with 
other people). We hope that somebody will come up with ideas we did not have the 
time or acuity to consider, new ways to explore the vast data indexed in ADS, or new 
ways to connect existing systems and give insights that were not previously possible. 
We will continue building and adding more services, in the near future improving the 
relevancy of the search, and preparing the new user interface to replace the existing 
ADS Abstract Service. Once the new user interface is rolled out, any new contributions 
to it will become visible to at least a million users every month, which will hopefully 
provide enough usability data to test even more of the new ideas. Intrigued and want see 
more or perhaps even build something? All of the code is published open-source and 
available through ADSs Github repo: http://github.com/adsabs We welcome 
any and all comments, criticisms and contributions. 
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